( This version may include code of Mark was getting support Ideas and experience knowledge during our solution 2nd winner hackathon of FruitBunchAI in TU/e Artificial Intelligent student association event) with a teammate about corona's mental health price.
The code got support from IBM and other friends with strength in the data science field. To convince the audience not to be confused.
I am a student and not a researcher. The work here might be on Github; therefore * this version is our data study project. No confusion in regulation and aim for changing the government rule.
The reader can see the topic:
https://bmrheijligers.medium.com/coronacalm-hackingmentalhealth-749d4a4e7af0
As the genuine challenge not decide to do Facial Recognition, and this version aim for Corona Measure.
import pandas as pd
import numpy as np
import matplotlib
from pyearth import Earth
from pyearth import export
import matplotlib.pyplot as plt
#from jupyterthemes import jtplot
#jtplot.style(theme='onedork')
%matplotlib inline
np.warnings.filterwarnings('ignore')
pd.__version__
# np.__version__
print('matplotlib: {}'. format(matplotlib. __version__))
# !wget -N https://data.rivm.nl/covid-19/COVID-19_casus_landelijk.json
# Pyspark support with big data and easy to show the frame.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df4 = spark.read.options(header='True', inferSchema='True', delimiter=';').csv('/home/markn/my_project/corona/COVID-19_casus_landelijk.csv')
df4.show(10)
df4.printSchema()
After checking the data set collum we have to know exactly where is the province of Netherlands.
df4.select("Province").distinct().show(truncate=False)
df = df4.toPandas()
df.head(10)
# df = pd.read_csv('/home/markn/my_project/corona/COVID-19_casus_landelijk.csv', sep=';', parse_dates=[0, 1], infer_datetime_format=True)
# df.info()
# from pandas.compat import StringIO
# df = pd.read_csv(StringIO(df1), parse_dates=[0, 1])
# df1.info()
df['Date_file'] = df['Date_file'].astype('datetime64[ns]')
df['Date_statistics'] = df['Date_statistics'].astype('datetime64[ns]')
df.info()
df.count()
daterep = 'Date_statistics'
region = 'Municipal_health_service'
cases = 'cases'
deaths = 'Deceased'
#we assign the difference collums for date,
lastdate = df[daterep].max() - pd.Timedelta('7 days')
df[cases] = 1
df[deaths] = df[deaths].apply(lambda x: 1 if x == 'Yes' else 0)
df
df_geo = df.pivot_table(index=daterep, columns=region, values=[cases, deaths], aggfunc='sum').fillna(0)
df_geo['cases']
new_index = pd.date_range(df_geo.index.min(), df_geo.index.max() + pd.Timedelta('365 days'))
df_geo = df_geo.reindex(new_index)
df_geo
df_geo['daynum'] = (df_geo.index - df_geo.index.min()).days
df_geo['daynum'].describe()
def gumpdf(x, beta, mu):
"""Return PDF value according to Gumbel"""
expon = - ((x - mu) / beta)
return(np.exp(expon) * np.exp(- (np.exp(expon))) / beta)
def gumcdf(x, beta, mu):
"""Return CDF value according to Gumbel"""
expon = - ((x - mu) / beta)
return(np.exp(- (np.exp(expon))))
import matplotlib as mpl
mpl.rc('figure', max_open_warning = 0)
regions = np.sort(df[region].unique())
ind_pos = [2,1,4,8]
regions1 = regions[ind_pos]
## Choose all
# regions1 = regions
regions
# regions1
# Select regions to fit.
regions = regions1
# Choose whether to output plots per region.
showplots = True
# region = 'GGD Brabant Zuid-Oost' and some order important area, i think this is good to choose Brabant-Zuid-Oost
# We will try to measure the case which going up and down and idenfity where
measure = cases
smeasure = 'Week window' # smoothed
rmeasure = 'rcases' # remaining
pmeasure = 'Model' # predicted wave
wmeasure = 'Wave ' # waves
### This one i made the measure for case and week windown to show the smoothed, remaining, ... of corona ware.
for region in regions:
### build the region in region.
wave = 1
###Apply the variable
df_geo[(pmeasure, region)] = 0
df_geo[(smeasure, region)] = df_geo[measure][region].loc[:lastdate].rolling(7).mean()
df_geo[(rmeasure, region)] = df_geo[smeasure][region]
plotlist = [(smeasure, region), (pmeasure, region)]
#countryname = df[df['geoId'] == country]['countriesAndTerritories'].iloc[0]
#popdata = df[df['geoId'] == country]['popData2019'].iloc[0]
#mincases = popdata / 1e6
mincases = 2
#mincases = df_geo[smeasure][country].sum() / 5000
#mincases = max(popdata / 1e6, 10)
print('Running multiple wave analysis for \'{}\''.format(region))
print('Minimum number of cases is {:1.0f}'.format(mincases))
## Show the line of wave analysis, i got help as reference from outsite.
while True:
curwave = wmeasure + str((wave) + 1000)[-2:]
df_geo[(curwave, region)] = 0
df_pred = pd.DataFrame({'daynum':df_geo['daynum'],
measure:df_geo[rmeasure][region]})
df_pred['gumdiv'] = df_pred[measure] / df_pred[measure].cumsum()
df_pred = df_pred[(df_pred['gumdiv'] > 0) & (df_pred[measure] > mincases)]
df_pred['linear'] = np.log(df_pred['gumdiv'])
df_pred = df_pred[(df_pred['linear'] < -0.5) &
(df_pred['linear'] > -3.5)]
if len(df_pred) <= 1:
print('--- no data left')
break
### Luckly not so much data left, it already good data occurrence per day.
eax = df_pred['daynum'].values.reshape(-1, 1)
eay = df_pred['linear'].values.reshape(-1, 1)
#eamodel = Earth()
#eamodel = Earth(minspan=0)
eamodel = Earth(minspan=1, penalty=0, endspan=0, thresh=1e-9, check_every=1)
eamodel.fit(eax, eay)
df_pred['earth'] = eamodel.predict(eax)
## Day min is the minumum date function min() help us choose the day.
## Day max is the max date function max() help us choose the top day.
daymin = df_pred['daynum'].min()
daymax = df_pred['daynum'].max()
#df_pred['gbgrad'] = np.gradient(df_pred['linear'])
#df_pred['eagrad'] = np.gradient(df_pred['earth'])
#In df_pred['linear'] is suitable more on the graph. You can see the gradient too.
df_pred['gbgrad'] = df_pred['linear'] - df_pred['linear'].shift(1)
df_pred['eagrad'] = df_pred['earth'] - df_pred['earth'].shift(1)
fitmod = export.export_python_function(eamodel)
## Apply export python_function(eamodel)
df_pred['knot'] = ((abs(df_pred['eagrad'] - df_pred['eagrad'].shift(1)) > 1e-6) |
(df_pred['daynum'] == (daymin + 1)) |
(df_pred['daynum'] == daymax))
df_pred['daycount'] = df_pred.reset_index().index
df_knot = df_pred[df_pred['knot']][['daynum', 'daycount', 'eagrad']]
df_knot['daysdata'] = df_knot['daycount'].shift(-1) - df_knot['daycount']
df_knot['daystime'] = df_knot['daynum'].shift(-1) - df_knot['daynum']
df_knot['cand'] = ((df_knot['eagrad'] < -1/77) &
(df_knot['daysdata'] >= 3))
df_knot['since'] = df_knot['daynum'] - daymin
df_knot['score'] = (df_knot['eagrad'] ** 2) * np.sqrt(df_knot['daysdata'] / np.sqrt(df_knot['since']))
df_knot['choice'] = df_knot['score'] == df_knot[df_knot['cand']]['score'].max()
choice = df_knot[df_knot['choice']]
if len(choice) == 0:
print('--- no data for wave')
break
lower = choice['daynum'].values[0]
upper = choice['daysdata'].values[0] + lower
df_pred = df_pred[(df_pred['daynum'] >= lower) &
(df_pred['daynum'] <= upper)].copy()
slope = (fitmod([[upper]])[0] - fitmod([[lower]])[0]) / (upper - lower)
intercept = fitmod([[lower]])[0] - (lower * slope)
### Slope does matter because they are more linear regression in here.
beta = - 1 / slope
mu = beta * (intercept + np.log(beta))
df_pred['pgumb'] = gumpdf(df_pred['daynum'], beta, mu)
df_pred['scale'] = df_pred[measure] / df_pred['pgumb']
final = df_pred['scale'].mean()
fincv = df_pred['scale'].std() / final
### Findcv
### Final.std() it is the goal to know cv of standarize / final.
df_geo[(curwave, region)] = final * gumpdf(df_geo['daynum'], beta, mu)
peak = df_geo[df_geo[(curwave, region)] == df_geo[(curwave, region)].max()].index.min()
start = df_geo[(df_geo[(curwave, region)] >= 1) &
(df_geo[(curwave, region)].index < peak)].index.min()
floor = df_geo[(df_geo[(curwave, region)] < 1) &
(df_geo[(curwave, region)].index > peak)].index.min()
### Beta wave and MU wave fit the peak from difference size.
### I assigned the wave, mu, betta to reac the beak of start date.
print('{} beta {:6.3f} mu {:3.0f} fit {:5.3f} peak {} from {} to {} size {:1.0f}'.format(
curwave, beta, mu, (1 - fincv) ** 2, peak.date(), start.date(), floor.date(), final))
df_geo[(pmeasure, region)] += df_geo[(curwave, region)]
df_geo[(rmeasure, region)] -= df_geo[(curwave, region)]
plotlist += [(curwave, region)]
wave += 1
if showplots:
df_geo[plotlist].loc['20200101':'20210404'].plot(
figsize=(16, 9),
grid=True,
kind='area',
stacked=False,
alpha=1/3,
title='Daily new cases for '+region,
colormap="Dark2")
df_geo[plotlist].loc['20200101':'20210404'].cumsum().plot(
figsize=(16, 9),
grid=True,
kind='area',
stacked=False,
alpha=1/3,
title='Cumulative cases for '+region,
colormap="magma")
Show the case per week, as standard, and we can see in April the minimal viral of corona cases are micro-level. It was starting to skyrocket during the case of 2021 Jan. Coming up with April, since the Summer vibe is reaching according to the Dataset, it is getting bombard in this time week 12 and week 12. There we should be doing the probability way to understand the situation properly to understand how the circumstance is working currently.
# !wget -N https://data.rivm.nl/covid-19/COVID-19_casus_landelijk.csv
import pandas as pd
import numpy as np
df_case = pd.read_csv(
'COVID-19_casus_landelijk.csv',
sep=';',
parse_dates=[0, 1],
infer_datetime_format=True)
df_case.tail(10)
## input by pandas
# df_case_spark = spark.read.options(header='True', inferSchema='True', delimiter=';').csv('COVID-19_casus_landelijk.csv')
# # df_case_spark.show(10)
# from pyspark.sql.types import *
# df_case_spark = df_case_spark.withColumn("Date_file",
# df_case_spark["Date_file"].cast(DateType()))
# df_case_spark = df_case_spark.withColumn("Date_statistics",
# df_case_spark["Date_statistics"].cast(DateType()))
df_case.info()
# df_case_spark.printSchema()
## As a big data input by spark
# df_case['period'] = df_case['Date_statistics'].apply(lambda x: x.isocalendar()[1])* 100 + apply(lambda x: x.isocalendar()[1])
# df_case['period']
df_case['period'] = df_case['Date_statistics'].apply(lambda x: x.isocalendar()[0])* 100 + df_case['Date_statistics'].apply(lambda x: x.isocalendar()[1])
df_case['period'] = df_case['period'].apply(str)
df_case['period'] = df_case['period'].apply(lambda x: x[: 4] + 'W' + x[4:])
df_case['group'] = df_case['Sex'] + ' ' + df_case['Agegroup']
df_case.tail(10)
data_df_heat_map = df_case[df_case['Municipal_health_service'] != ''].pivot_table(
index='period',
columns='group',
values='Date_statistics',
aggfunc='count').fillna(0)
# Select columns to use, optionally subset or use relative numbers
#data_df_heat_map['total'] = data_df_heat_map[data_df_heat_map.columns[0:24]].sum(axis=1)
data_df_heat_map = data_df_heat_map[data_df_heat_map.columns[0:24]]
# setting on the relatie growth numbers
# data_df_heat_map = data_df_heat_map / data_df_heat_map.shift()
data_df_heat_map.tail(5).loc[::-1].transpose()
According to males and females, this plot depends on the week to week number of cases getting increase.
## Define array of row and columns headers
durationsweekly = data_df_heat_map.index
agegroups = data_df_heat_map.columns
## Output size to modified with data size and length
fig, ax = plt.subplots(figsize=(26,12))
heatmap = plt.imshow(
np.log(data_df_heat_map[data_df_heat_map > 0].loc[:].transpose()),
cmap='Accent',
interpolation='None',
aspect='auto',
origin='lower')
# Value add to be axis tick label
ax.set_xticks(np.arange(len(durationsweekly)))
ax.set_yticks(np.arange(len(agegroups)))
ax.set_xticklabels(durationsweekly)
ax.set_yticklabels(agegroups)
# X labels diagonally
plt.setp(
ax.get_xticklabels(),
rotation=45,
ha="right",
rotation_mode="anchor")
# Convert dataframe to numpy dataframe
np_heat = data_df_heat_map.to_numpy()
# Set numbers as text labeles
for i in range(len(durationsweekly)):
for j in range(len(agegroups)):
text = ax.text(
i,
j,
int(np_heat[i, j]),
ha="center",
va="center",
color="w")
ax.set_title("Positive tests weekly, on sex and age group")
fig.tight_layout()
plt.show()
We can see highly many case has been raising from 2021 when it has stablished.
This perosnally support the project to know exactly for every tiny cell work actively show the number. And by heatmap it able to show the impact of cases.
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
#can be consider duration as period time.
durationsweekly = data_df_heat_map.index
agegroups = data_df_heat_map.columns
# df = data_df_heat_map
fig = px.imshow(np.log(data_df_heat_map[data_df_heat_map > 0].loc[:].transpose()))
# Convert dataframe to numpy dataframe
np_heat = data_df_heat_map.to_numpy()
# Set numbers as text labeles
for i in range(len(durationsweekly)):
for j in range(len(agegroups)):
text = ax.text(
i,
j,
int(np_heat[i, j]),
ha="center",
va="center",
color="w")
# fig['layout']['yaxis']['autorange'] = "reversed"
# iplot(fig)
fig.show()
## Plot the file heat map into this real case.
2021 have extreamly case within the top peak of aleart in NL. As we can see how the heat raise to currenly time.
We can see the density of new case happened from 2021W04 to 2021W12 they are decrease compare with Week14. The more the pandemic lasts, the more patients are getting bigger. Therefore we can see massive points during the new year. Even though some of the corona measures already start and now some restrictions started in NL. In the opposite way of research data analytic EDA, we need a good insight into how action worldwide is working right now with the measurement method in next notebook. Dus, we can know which one is going to be suited with NL level. After that, some map and other work can be implemented into this graph.